UFO Sighting Analysis

Reports of unidentified flying objects have been around for over a century. My goal with this project is to determine if there is any kind of pattern in regards to the sightings of UFOs across the United States and to predict some aspect of future sightings using data collected between the years of 1910 and 2014. Remember that UFO does not equal aliens, but if you want to believe that this data represents alien sightings, I will not stop you!

Setting Up:

library(tidyverse)
library(janitor)
library(leaflet)
library(readxl)
library(easystats)
library(modelr)
library(caret)

Loading the Data:

df <- read_csv("./ufo_data.csv") %>% 
  janitor::clean_names()
glimpse(df)
## Rows: 80,332
## Columns: 11
## $ datetime           <chr> "10/10/1949 20:30", "10/10/1949 21:00", "10/10/1955…
## $ city               <chr> "san marcos", "lackland afb", "chester (uk/england)…
## $ state              <chr> "tx", "tx", NA, "tx", "hi", "tn", NA, "ct", "al", "…
## $ country            <chr> "us", NA, "gb", "us", "us", "us", "gb", "us", "us",…
## $ shape              <chr> "cylinder", "light", "circle", "circle", "light", "…
## $ duration_seconds   <dbl> 2700, 7200, 20, 20, 900, 300, 180, 1200, 180, 120, …
## $ duration_hours_min <chr> "45 minutes", "1-2 hrs", "20 seconds", "1/2 hour", …
## $ comments           <chr> "This event took place in early fall around 1949-50…
## $ date_posted        <chr> "4/27/2004", "12/16/2005", "1/21/2008", "1/17/2004"…
## $ latitude           <dbl> 29.88306, 29.38421, 53.20000, 28.97833, 21.41806, 3…
## $ longitude          <dbl> -97.941111, -98.581082, -2.916667, -96.645833, -157…

As we can see above, there is some cleaning to be done before we get started on analyzing any data.

Tidying the Data:

This analysis is only going to focus on UFO sightings within the United States, so the data needs to reflect that.

df <- df %>%
  mutate(country = case_when(
    state %in% c("al", "ak", "az", "ar", "ca", "co", "ct", "de", "dc", "fl", 
                 "ga", "hi", "id", "il", "in", "ia", "ks", "ky", "la", "me", 
                 "md", "ma", "mi", "mn", "ms", "mo", "mt", "ne", "nv", "nh", 
                 "nj", "nm", "ny", "nc", "nd", "oh", "ok", "or", "pa", "ri", 
                 "sc", "sd", "tn", "tx", "ut", "vt", "va", "wa", "wv", "wi", "wy") ~ "us",
    TRUE ~ "other"
  ))

df <- df %>% 
  filter(country == "us")

One would expect the above code to adequately filter submissions but when looking at latitudes and longitudes, there are still some entries that fall outside of the United States. To fix that, I’m going to create some boundaries.

continental_us <- list(
  xmin = -125.0,  # Westernmost point 
  xmax = -66.93457,  # Easternmost point
  ymin = 24.396308,  # Southernmost point
  ymax = 49.384358  # Northernmost point
)

alaska_hawaii <- list(
  xmin = -178.2166,  # Westernmost point
  xmax = -129.9943,  # Easternmost point
  ymin = 18.9117,  # Southernmost point
  ymax = 71.5388  # Northernmost point 
)

continental_us_data <- df %>%
  filter(latitude >= continental_us$ymin & latitude <= continental_us$ymax &
           longitude >= continental_us$xmin & longitude <= continental_us$xmax)

alaska_hawaii_data <- df %>%
  filter(latitude >= alaska_hawaii$ymin & latitude <= alaska_hawaii$ymax &
           longitude >= alaska_hawaii$xmin & longitude <= alaska_hawaii$xmax)

df <- rbind(continental_us_data, alaska_hawaii_data)

Another problem with this data is the shapes of UFOs.

unique(df$shape)
##  [1] "cylinder"  "light"     "circle"    "sphere"    "disk"      "fireball" 
##  [7] "unknown"   "oval"      "other"     "rectangle" "chevron"   "formation"
## [13] "triangle"  "cigar"     NA          "delta"     "changing"  "diamond"  
## [19] "flash"     "egg"       "teardrop"  "cone"      "cross"     "pyramid"  
## [25] "round"     "crescent"  "flare"     "hexagon"   "dome"      "changed"

There are a lot of values that could mean the same thing, so we’re going to group those together.

df <- df %>% 
  mutate(shape = case_when(
  shape %in% c("light", "fireball", "flash", "flare") ~ "light",
  shape %in% c("circle", "sphere", "egg", "oval", "disk", "round") ~ "spherical",
  shape %in% c("cylinder", "rectangle", "cigar") ~ "rectangular",
  shape %in% c("triangle", "pyramid") ~ "triangular",
  is.na(shape) | shape %in% c("unknown", "other", "changing", "changed", "formation") ~ "other",
  shape %in% c("delta", "chevron") ~ "delta",
  shape %in% c("diamond", "hexagon") ~ "diamond",
  shape %in% c("cone", "dome", "crescent") ~ "cone",
  TRUE ~ as.character(shape)))

unique(df$shape)
##  [1] "rectangular" "light"       "spherical"   "other"       "delta"      
##  [6] "triangular"  "diamond"     "teardrop"    "cone"        "cross"

Visualizing the Data:

First, let’s put up a map of all the UFO sightings in the United States:

Which U.S. state/territory has the most UFO sightings?

It makes a lot of sense that California, the most populous state, would have the most UFO sightings.Let’s see if we can find out which state has the highest percentage of UFO sightings compared to population.

Washington, which was second highest in total number of sightings, actually has the highest percentage of sightings to population.

Now let’s figure out what the most common shape of UFO seen in the U.S. is.

most_common_shape <- df %>%
  count(shape) %>%
  arrange(desc(n))

head(most_common_shape, n = 1)
## # A tibble: 1 × 2
##   shape     n
##   <fct> <int>
## 1 light 21356

Not entirely surprisingly, the most common UFO seen is light, with over 21,000 sightings.

What’s the average duration of a UFO sighting?

mean(df$duration_seconds, na.rm = TRUE)
## [1] 5728.398

5727.5 seconds is over an 1.5 hours, which is a pretty long time to watch a UFO fly overhead.

Let’s answer a few more questions: How has the frequency of UFO sightings changed over the years?

Sightings peaked around 2012, before a rapid drop in occurrence.

What season do sightings occur in most?

There are more sightings in the summer than there are in any other season, presumably because people are far more comfortable being outside late at night in warmer temperatures.

We know that summertime has the most sightings, but in what month do the most UFO sightings occur?

sightings_by_month <- df %>% 
  group_by(month) %>% 
  summarize(count = n())

ggplot(sightings_by_month, aes(x = month, y = count, fill = month)) +
  geom_col() +
  labs(title = "UFO Sightings by Month",
       x = "Month",
       y = "Number of Sightings") +
  theme_minimal() +
  theme(legend.position = "none")

Summer was defined as months 6, 7, and 8, which are June, July, and August respectively. Not shockingly, those same three months are have the highest rate of sightings, with July having the highest. There are a lot of reasons why this could be: more drinking, more partying, more fireworks, or just being outside more.

Let’s do something fun! I have a hypothesis that UFO sightings are more common near military bases, so we’ll plot military bases on that UFO sighting map from earlier (in orange):

There are a lot of UFO sightings around military bases, but there are also a lot of UFO sightings in areas with none. We can draw one of two conclusions from this:

  1. There is no visible relationship between UFO sightings and military bases.

  2. The areas with a lot of UFO sightings and no military bases actually do have military bases, but they aren’t listed.

I’ll let you decide which is more likely.

Data Modeling:

Now I’m going to try to answer a question by modeling this data.

What is the duration of a UFO sighting likely to be in the year 2024?

testing <- df %>%
  select(duration_seconds, month, year, state, shape)

mod1 <- glm(data = testing,
            formula = duration_seconds ~ month + year + state + shape)

mod2 <- glm(data = testing,
            formula = duration_seconds ~ month * year + state + shape)

mod3 <- glm(data = testing,
            formula = duration_seconds ~ month * year * state + shape)

compare_performance(mod1, mod2, mod3) %>% plot()

Of the models that the computer has managed to run, models 1 and 3 are equally bad, but better than model 2. For the sake of moving onward, I’ll make the claim that model 3 is the ‘best’ model to answer this question. Below, I’ve taken a subset of the data to put aside as a testing set.

id <- caret::createDataPartition(testing$duration_seconds, p = 0.8, list = FALSE)
train <- testing[id, ]
test <- testing[-id, ]

mod4 <- glm(data = train,
            formula = mod3$formula)

add_predictions(test, mod4) %>% 
  mutate(error = abs(pred - duration_seconds)) %>% 
  pluck('error') %>% 
  summary()
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0      811     2185    17487     3418 52620301
add_predictions(testing, mod4) %>% 
  mutate(error = abs(pred - duration_seconds)) %>% 
  pluck('error') %>% 
  summary()
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0      794     2141    10264     3363 63577905

There’s a wide range of error from this model, and my first hypothesis as to why is that UFO sightings are truly random. Let’s move on to answering our question, though. I’ve created a new UFO sighting, which has taken place in the month of September later this year.

data_2024 <- data.frame(
  month = "9",
  year = 2024,
  state = "ut",
  shape = "light"
)

predict(mod4, data_2024)
##       1 
## 3293.38